Detection of sentence boundaries and abbreviations in clinical narratives
نویسندگان
چکیده
BACKGROUND In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms. METHODS The problem is addressed by two binary classifiers for abbreviation and sentence detection. A support vector machine exploiting a linear kernel is trained on different combinations of feature sets for each classification task. Feature relevance ranking is applied to investigate which features are important for the particular task. The methods are applied to German language texts from a medical record system, authored by specialized physicians. RESULTS Two collections of 3,024 text snippets were annotated regarding the role of period characters for training and testing. Cohen's kappa resulted in 0.98. For abbreviation and sentence boundary detection we can report an unweighted micro-averaged F-measure using a 10-fold cross validation of 0.97 for the training set. For test set based evaluation we obtained an unweighted micro-averaged F-measure of 0.95 for abbreviation detection and 0.94 for sentence delineation. Language-dependent resources and rules were found to have less impact on abbreviation detection than on sentence delineation. CONCLUSIONS Sentence detection is an important task, which should be performed at the beginning of a text processing pipeline. For the text genre under scrutiny we showed that support vector machines exploiting a linear kernel produce state of the art results for sentence boundary detection. The results are comparable with other sentence boundary detection methods applied to English clinical texts. We identified abbreviation detection as a supportive task for sentence delineation.
منابع مشابه
Unsupervised Multilingual Sentence Boundary Detection
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using thre...
متن کاملDisambiguation of Period Characters in Clinical Narratives
The period character’s meaning is highly ambiguous due to the frequency of abbreviations that require to be followed by a period. We have developed a hybrid method for period character disambiguation and the identification of abbreviations, combining rules that explore regularities in the right context of the period with lexicon-based, statistical methods which scrutinize the preceding token. T...
متن کاملHeuristic Sentence Boundary Detection and Classification
This paper explores the new methodology of detecting boundaries of the sentence by heuristic method and also classifies it. Automatic true detection of the sentence aids in semantically annotating the web. Sentences formed with URL, ellipsis and abbreviations are focus of the study. High performance features are selected for Classification using C4.5 decision trees and K-Means for clustering wi...
متن کاملUnsupervised Abbreviation Detection in Clinical Narratives
Clinical narratives in electronic health record systems are a rich resource of patient-based information. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambiguation of period characters is therefore an import...
متن کاملMicrosoft Word - camera-ready.docx
We explore methods for effectively extracting information from clinical narratives, which are captured in a public health consulting phone service called HealthLink. The currently available data consists of dialogues constructed by nurses while consulting patients on the phone. Since the data are interviews transcribed by nurses during phone conversations, they include a significant volume and ...
متن کامل